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Preparation phase 




User defines the following: 



. Web page content types that 
the method must recognize 



N 


(Company News) 


C 


(Contact information) 


P 


(Product information) 


M 


(Management team) 


D 


(Company description) 


...etc... 



.Set of tests that provide evidence 
about the content type ^ 



Tl = "Number of external links on page > 5" 

T2 = "Number of internal links > 10" 

T3 = "Link text contains contact keywords 

(e.g. address, location, contact, etc)" 

T4 = "Number of people names in page > 3" 

T5 = "Page contains stock ticker symbol" 

T6 = "Page contains header starting with word 

'About..."" 

...etc... 



Fig. 1 
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Training phase y 



^3 ~~> 

Training set of Web pages 
with known contents 



= □ 



Content types for each Web page 
in the training set 




Calculate 
statistics 



22, 

Test results for each Web page 
in the training set 



Paae 


Tl 


T2 


T3 


T4 


1 


T 


F 


T 


F 


2 


F 


T 


F 


F 


3 


F 


F 


T 


T 


4 


F 


F 


T 


T 


etc. 











P(H=N) = 0.20 




P(H=C) = 0.20 


etc 


P(T1=T|H=N) = 
P(T1=F|H=N) = 


0.4630 
0.5370 


P(T1=T|H=C) = 
P(T1=T|H=C) = 


0.2344 
0.7656 


P(T2=T|H=N) = 
P(T2=F[H=N) = 


0.2647 
0.7353 


P(T2=T|H=C) = 
P(T2=T|H=C) = 


0.6224 etc 

0.3776 


P(T3=T|H=N) = 
P(T3=F|H=N) = 


0.7352 
0.2648 


P(T3=T|H=C) = 
P(T3=T|H=C) = 


0.2432 
0.7568 


etc 




etc... 





Fig. 2 
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Classification phase 



Subject Web page 

(unknown content 
type) 





Test results for subiect site 




Statistics from training phase ^2. ;? ) 



P(H=N) = 0.20 




P(H=C) = 0.20 




etc 


P(T1=T|H=N) = 
P(T1=F|H=N) = 


0.4630 
0.5370 


P(T1=T|H=C) = 
P(T1=T|H=C) = 


0.2344 
0.7656 




P(T2=T|H=N) = 
P(T2=F|H=N) = 


0.2647 
0.7353 


P(T2=T|H=C) = 
P(T2=T|H=C) = 


0.6224 
0.3776 


....etc 


P(T3=T|H=N) = 
P(T3=F|H=N) = 


0.7352 
0.2648 


P(T3=T|H=C) = 
P(T3=T|H=C) = 


0.2432 
0.7568 




etc 




etc... 







Bayesian Network ■ 



(combine test results and 
calculate confidence level 
for each candidate type) 




Confidence levels ^ j£ 
for each content type 



Content tvpe 


Conf Level 


N (Company News) 


22% 


C (Contact information) 


4% 


P (Product information) 


89% 


M (Management team) 


7% 


D (Company description) 


92% 



Fig. 3 
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Preferred embodiment 



Input 



Test 
module 



Training 
module 

5^ 



Bayesian 
Network 
module 5X 



Output 



Fig. 4 



